HBO

Ashley Wright & Mubeena Wahaj

2022-05-02

Lights, camera, action!

Today, we’re going to take a deep dive into the world of HBO movies and TV shows. From the iconic dramas like The Sopranos and Game of Thrones to the latest releases. HBO has been providing quality content to its viewers for decades, but have you ever wondered how they make decisions about what shows to produce or which movies to acquire? That’s where the fascinating world of HBO data comes into play. By analyzing audience trends, ratings, and viewer demographics, HBO can make informed decisions about what to offer to its loyal fans. So sit back, grab a snack, and get ready to explore the exciting world of HBO data.

Installing packages

#install.packages("ggrepel")
#install.packages("ggiraph") ## to use geom_tooltip
#install.packages("ggiraphExtra")
#load tidyverse to manipulate data
#load ggplot2 for graphing
#load shiny to...
#load dplyer to manipulate data
#load knitr for general-purpose literate programming
#load kableExtra to add features to table

library(ggrepel) ## For using tooltip
library(ggiraph)  ## For using geom_tooltip
library(ggiraphExtra)
library(tidyverse)
library(ggplot2)
library(shiny)
library(dplyr)
library(countrycode)
library(knitr)
library(kableExtra)
library(maps)

About Our Data

The data we’ve decided to work on is from kaggle and is owned by Diego Enrique and here’s the link: https://www.kaggle.com/datasets/dgoenrique/hbo-max-movies-and-tv-shows

Titles data:

15 variables, 3030 observations

id: The title ID

title: The name of the title

show_type: Tv show or Movie

description: A description of movie or tv show

release_year: Year show/movie was released

age_certification: The age rating of movie or show

runtime: The length of the episode of show or movie in minutes

genres: A list of genres

production_countries: Countries that produced the show/movie

seasons: Number of seasons IF it is a show

imdb_id: The title ID on IMDB

imdb_score: Score on IMDB

imdb_votes: Votes on IMDB

tmdb_popularity: Popularity on TMDB

tmdb_score: Score on TMDB

Credits data:

5 variables, 64879 observations

person_ID: The person ID on JustWatch

id: The title ID on JustWatch

name: The name of actor or director

character_name: The name of character played in movie/show

role: ACTOR or DIRECTOR

Let us read our datas, shall we?

We’re using the kable and head function to show a part of the data sets we’re working on but in an organized manner

Here’s our credits.csv

Sample table of credits data
person_id id name character role
14701 tm77588 Humphrey Bogart Rick Blaine ACTOR
14702 tm77588 Ingrid Bergman Ilsa Lund ACTOR
14703 tm77588 Paul Henreid Victor Laszlo ACTOR
14704 tm77588 Claude Rains Captain Louis Renault ACTOR
14705 tm77588 Conrad Veidt Major Heinrich Strasser ACTOR
14706 tm77588 Sydney Greenstreet Signor Ferrari ACTOR

And here’s our titles.csv

Sample table of titles data
id title type release_year age_certification runtime genres production_countries seasons imdb_id imdb_score imdb_votes tmdb_popularity tmdb_score
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167
tm155702 The Wizard of Oz MOVIE 1939 G 102 [‘fantasy’, ‘family’] [‘US’] NA tt0032138 8.1 406105 56.631 7.583
tm83648 Citizen Kane MOVIE 1941 PG 119 [‘drama’] [‘US’] NA tt0033467 8.3 446627 19.900 8.022
tm3175 Meet Me in St. Louis MOVIE 1945 113 [‘drama’, ‘family’, ‘romance’, ‘music’, ‘comedy’] [‘US’] NA tt0037059 7.5 25589 8.311 7.000
ts225761 Tom and Jerry SHOW 1940 8 [‘animation’, ‘comedy’, ‘family’, ‘action’] [‘US’] 16 tt6422744 7.7 859 1.400 10.000
tm156463 Gone with the Wind MOVIE 1940 G 238 [‘drama’, ‘romance’, ‘war’, ‘history’] [‘US’] NA tt0031381 8.2 319463 27.535 8.000

What if we try to combine these data sets?

both_data <- inner_join(titles, credits, by = "id")

kable(head(both_data),
      align = "c",
      caption = "<b><center>Sample table of both data",
      format = "html") %>% 
  kable_styling(bootstrap_options = "bordered", full_width = FALSE)
Sample table of both data
id title type release_year age_certification runtime genres production_countries seasons imdb_id imdb_score imdb_votes tmdb_popularity tmdb_score person_id name character role
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167 14701 Humphrey Bogart Rick Blaine ACTOR
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167 14702 Ingrid Bergman Ilsa Lund ACTOR
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167 14703 Paul Henreid Victor Laszlo ACTOR
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167 14704 Claude Rains Captain Louis Renault ACTOR
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167 14705 Conrad Veidt Major Heinrich Strasser ACTOR
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167 14706 Sydney Greenstreet Signor Ferrari ACTOR

Let’s begin by determining the number or movies and TV shows we are working with

Type Count
MOVIE 2408
SHOW 622

Wow! that’s a lot more movies than shows! But let’s see it visually

What’s the distribution of genres for both Shows and Movies in our dataset?

Here’s the table of number of genres in descending order

Since we just finished oberserving the number of genres in our dataset

Let us see if there’s a correlation between age_restriction and genres

## Unique age certifications:  PG, G, PG-13, R, TV-G, TV-Y, TV-Y7, TV-PG, NC-17, TV-14, TV-MA, TV-Y7-FV

Well that did not work as expected. Let’s see if a geom_tile graph does the job:

Now lets look at our Actor and Director columns in our credits data.

role n
ACTOR 62158
DIRECTOR 2721

Since actors and directors can have multiple projects, lets remove the duplicates

Unique name for actors
unique_names_of_actors
43930
Unique name for directors
unique_names_of_directors
1730

Are any of these actors/directors in multiple projects? If so, who was in the most projects?

person_id name role total_projects
14142 Grey DeLisle ACTOR 60
529 Frank Welker ACTOR 50
20372 Tara Strong ACTOR 42
7997 Kevin Michael Richardson ACTOR 36
6821 Fred Tatasciore ACTOR 35
18723 Dee Bradley Baker ACTOR 35
person_id name role total_projects
21759 Charlie Chaplin DIRECTOR 22
27098 Sam Liu DIRECTOR 17
69510 Jon Alpert DIRECTOR 15
106013 Yasujirō Ozu DIRECTOR 13
210814 Satyajit Ray DIRECTOR 13
192306 Alexandra Pelosi DIRECTOR 11

Here are the number of shows available in HBO by release year

Here’s the summary table of what the graph shows

decade type count
1900s MOVIE 8
1910s MOVIE 12
1920s MOVIE 35
1920s SHOW 1
1930s MOVIE 44
1940s MOVIE 57
1940s SHOW 1
1950s MOVIE 91
1960s MOVIE 130
1960s SHOW 6
1970s MOVIE 109
1970s SHOW 3
1980s MOVIE 170
1980s SHOW 6
1990s MOVIE 265
1990s SHOW 43
2000s MOVIE 395
2000s SHOW 94
2010s MOVIE 706
2010s SHOW 230
2020s MOVIE 274
2020s SHOW 148
NA MOVIE 112
NA SHOW 90

This indicates us that HBO primarily features Movies and Shows from the decade of 2010s

You can see there is a wide range of movies and tv shows, especially what year they were released. I wonder what the oldest movies and shows are?

Oldest Movie on HBO
title release_year genres
The Prince of Magicians 1901 [‘comedy’]
Oldest Show on HBO
title release_year genres
Looney Tunes 1929 [‘comedy’, ‘family’, ‘thriller’, ‘animation’]

Now explore if there’s a relationship between longest movie and its popularity?

Since we’re looking at runtimes, lets see what’s HBO’s shortest movie and show and the longest movie and show
Shortest movie on HBO
title runtime release_year genres
An Impossible Balancing Feat 1 1902 []
Longest movie on HBO
title runtime release_year genres
Scenes from a Marriage 299 1974 [‘drama’, ‘european’]
Shortest Show on HBO
title runtime seasons release_year genres
Meet the Batwheels 2 1 2022 [‘animation’, ‘action’]
Longest Show on HBO
title runtime seasons release_year genres
Sesame Street 51 53 1969 [‘comedy’, ‘animation’, ‘family’, ‘fantasy’, ‘music’]

Who would’ve know?!

Last but not least, let us look at the Number of movies and TV shows by country

Unfortunately, because HBO only got their movies and shows from 99 countries, there are some uncolored countries

Shiny applications not supported in static R Markdown documents
countries = titles %>%
  mutate(production_countries = str_remove_all(production_countries, "'")) %>% 
  mutate(production_countries = gsub("\\[", "", production_countries)) %>% 
  mutate(production_countries = gsub("\\]", "", production_countries)) %>% 
  separate_rows(production_countries, sep = ", ") %>%
  group_by(production_countries, type) %>%
  summarize(total = n()) %>%
  arrange(desc(total))

## Removing rows that has country unknown
countries = filter(countries, !is.na(production_countries) & production_countries != "") 

## Creating a new column that represents the full country name of production countries
countries$full_country_name <- countrycode(sourcevar = countries$production_countries, origin = "iso2c", destination= "country.name")




kable(countries,
      align = "c",
      caption = "<b><center>Number of movies and TV shows by country",
      format = "html")%>% 
    kable_styling(bootstrap_options = "bordered", full_width = FALSE)
Number of movies and TV shows by country
production_countries type total full_country_name
US MOVIE 1824 United States
US SHOW 491 United States
GB MOVIE 270 United Kingdom
FR MOVIE 178 France
JP MOVIE 112 Japan
DE MOVIE 87 Germany
CA MOVIE 75 Canada
IT MOVIE 54 Italy
GB SHOW 38 United Kingdom
ES MOVIE 28 Spain
MX MOVIE 26 Mexico
AU MOVIE 24 Australia
SE MOVIE 22 Sweden
IN MOVIE 20 India
ES SHOW 19 Spain
CH MOVIE 16 Switzerland
HK MOVIE 14 Hong Kong SAR China
CN MOVIE 13 China
DK MOVIE 13 Denmark
BE MOVIE 12 Belgium
BR SHOW 12 Brazil
NZ MOVIE 12 New Zealand
PL MOVIE 11 Poland
SU MOVIE 11 NA
ZA MOVIE 10 South Africa
AR MOVIE 8 Argentina
AT MOVIE 8 Austria
IE MOVIE 8 Ireland
CA SHOW 7 Canada
JP SHOW 7 Japan
NL MOVIE 7 Netherlands
AE MOVIE 6 United Arab Emirates
AR SHOW 6 Argentina
MX SHOW 6 Mexico
PR MOVIE 6 Puerto Rico
SG SHOW 6 Singapore
TW SHOW 6 Taiwan
BG MOVIE 5 Bulgaria
CO MOVIE 5 Colombia
DE SHOW 5 Germany
FR SHOW 5 France
IL MOVIE 5 Israel
IL SHOW 5 Israel
SN MOVIE 5 Senegal
BR MOVIE 4 Brazil
CZ MOVIE 4 Czechia
DO MOVIE 4 Dominican Republic
GR MOVIE 4 Greece
IT SHOW 4 Italy
KR MOVIE 4 South Korea
XC MOVIE 4 NA
BO MOVIE 3 Bolivia
CL MOVIE 3 Chile
CL SHOW 3 Chile
CU MOVIE 3 Cuba
EC MOVIE 3 Ecuador
HU MOVIE 3 Hungary
IS MOVIE 3 Iceland
NO MOVIE 3 Norway
PT MOVIE 3 Portugal
UY MOVIE 3 Uruguay
CZ SHOW 2 Czechia
DZ MOVIE 2 Algeria
FI MOVIE 2 Finland
ID SHOW 2 Indonesia
IR MOVIE 2 Iran
LU MOVIE 2 Luxembourg
NG MOVIE 2 Nigeria
PE MOVIE 2 Peru
PK MOVIE 2 Pakistan
PL SHOW 2 Poland
RO MOVIE 2 Romania
RO SHOW 2 Romania
RU MOVIE 2 Russia
TH MOVIE 2 Thailand
TR MOVIE 2 Turkey
AF MOVIE 1 Afghanistan
AU SHOW 1 Australia
BS MOVIE 1 Bahamas
CN SHOW 1 China
DK SHOW 1 Denmark
EG MOVIE 1 Egypt
GT MOVIE 1 Guatemala
HU SHOW 1 Hungary
IN SHOW 1 India
KH MOVIE 1 Cambodia
MA MOVIE 1 Morocco
MC MOVIE 1 Monaco
MK MOVIE 1 North Macedonia
NZ SHOW 1 New Zealand
PA MOVIE 1 Panama
PH MOVIE 1 Philippines
PH SHOW 1 Philippines
PY MOVIE 1 Paraguay
RU SHOW 1 Russia
RW MOVIE 1 Rwanda
SG MOVIE 1 Singapore
UA MOVIE 1 Ukraine
UY SHOW 1 Uruguay
# Renaming "United States" to "USA" to match with data in world_map
countries <- countries %>% 
  mutate(full_country_name = if_else(full_country_name == "United States", "USA", full_country_name))

## We're going to try to show this data using a world map
world_map = map_data("world")
  

## Let's join our world map and our countries data
world_map_data= full_join(world_map, countries, by = c("region" = "full_country_name"))

## Plotting
 
  ggplot(world_map_data, aes(x = long, y = lat, group = group, fill = total)) +
  geom_polygon() +
  coord_equal() +
  scale_fill_gradient2(limits = c(0, 1824), low = "lightblue", mid = "blue", high = "darkblue", midpoint = 900) +
  labs(title = "Number of movies and TV shows by country", fill = "Count") + 
  theme_bw()